These notes are based on Chapter 12 of Jones, O., Maillardet, R., & Robinson, A. (2009). Introduction to scientific programming and simulation using R. CRC Press.
Optimisation concerns search for the maximum (or minimum) of a function. Formally, a simple optimisation problem can be written quite compactly as follows: \[ \max_{x\in\mathbb{R}^k} f(x) \quad\text{or}\quad \arg\max_{x\in\mathbb{R}^k} f(x) .\] The former returns the (global) maximum, while the latter returns a set of (global) maximisers. In other words, \(f(x^*) = \max f(x)\) such that \(f(x)\leq f(x^*)\), \(\forall x\in\mathbb{R}^k\) where \(x^*\in \arg\max f(x)\).
In these notes, we discuss only maximisation problems because any minimisation problem can readily be transformed into a maximisation problem as follows: \[ \min_{x\in\mathbb{R}^k} f(x) = (-1)\max_{x\in\mathbb{R}^k} -f(x).\]
Global maximisers are often difficult to find, because our knowledge of \(f\) is only local. In other words, due to a limited number of points examined by our algorithm, we usually do not have a good picture of the overall shape of \(f\). For example, it is hard to find a global maximiser of \(f(x)=\sin(3\pi x)/x+x^3-5x^2+6x\) without the following graph.
curve(sin(3*pi*x)/x+x^3-5*x^2+6*x, from=.26, to=3.6, ylab="y", n=301)
title(expression(y==sin(3*pi*x)/x+x^3-5*x^2+6*x))
This may seem like only a contrived case. NO! Modern optimisation problems in practice often involve much more complex objective functions (e.g., non-parametric functions or deep neural networks).
Consequently, at the current state of knowledge, most algorithms are able to find only local maximisers. Formally, a point \(x^*\) is a local maximiser if there exists a neighbourhood \(\mathcal{N}\) of \(x^*\) such that \(f(x)\leq f(x^*)\) for all \(x\in\mathcal{N}\).
In these notes, we use two maximisation problems to test our algorithms. The first one: \[ \max_{x\in\mathcal{D}} f_1(x) = \frac{\sin(3\pi x)}{x}+x^3-5x^2+6x .\] where \(\mathcal{D} = [0.2, 3.4]\).
Here is a R function that returns \(f_1\), \(f_1'\), and \(f_1''\).
test1 <- function(x) {
f <- sin(3*pi*x)/x+x^3-5*x^2+6*x
grad <- (3*pi*x*cos(3*pi*x) - sin(3*pi*x))/x^2 + 3*x^2 - 10*x + 6
H <- -(9*pi^2*x^2*sin(3*pi*x) + 2*(3*pi*x*cos(3*pi*x) - sin(3*pi*x)))/x^3
+ 6*x - 10
return(c(f,grad,H))
}
The second one: \[ \max_{(x,y)\in\mathcal{D}} f_2(x,y) =\sin\left(\frac{x^2}{2} - \frac{y^2}{4}\right)\cos\left(2x-e^y\right) .\] where \(\mathcal{D} = [-0.5, 3]\times[-0.5, 2]\). The following is a surface plot.
# c.f. https://stackoverflow.com/a/45424943
X = seq(-.5, 3, length=100)
Y = seq(-.5, 2, length=100)
XY = expand.grid(X=X, Y=Y) # create a grid
f2 <- function(x, y){
sin(x^2/2 - y^2/4)*cos(2*x-exp(y))
}
Z <- f2(XY$X, XY$Y)
s = interp(x=XY$X, y=XY$Y, z=Z)
# https://stackoverflow.com/a/61118759 for z=t(s$z)
plot_ly(x=s$x, y=s$y, z=t(s$z)) %>% add_surface()
Here is a R function that returns \(f\), \(\nabla f\), and the Hessian of \(f\).
test2 <- function(xx) {
x <- xx[1]
y <- xx[2]
f <- sin(x^2/2-y^2/4)*cos(2*x-exp(y))
f1 <- x*cos(2*x-exp(y))*cos(x^2/2-y^2/4) -
2*sin(2*x-exp(y))*sin(x^2/2-y^2/4)
f2 <- -(y*cos(2*x-exp(y))*cos(x^2/2-y^2/4))/2 +
exp(y)*sin(2*x-exp(y))*sin(x^2/2-y^2/4)
f11 <- cos(exp(y)-2*x)*cos(x^2/2-y^2/4) -
x^2*sin(x^2/2-y^2/4)*cos(exp(y)-2*x) +
4*x*sin(exp(y)-2*x)*cos(x^2/2-y^2/4) -
4*sin(x^2/2-y^2/4)*cos(exp(y)-2*x)
f12 <- -x*exp(y)*sin(exp(y)-2*x)*cos(x^2/2-y^2/4) -
y*sin(exp(y)-2*x)*cos(x^2/2-y^2/4) +
2*exp(y)*sin(x^2/2-y^2/4)*cos(exp(y)-2*x) +
x*y*sin(x^2/2-y^2/4)*cos(exp(y)-2*x)/2
f22 <- -exp(y)*sin(exp(y)-2*x)*sin(x^2/2-y^2/4) -
cos(exp(y)-2*x)*cos(x^2/2-y^2/4)/2 -
y^2*sin(x^2/2-y^2/4)*cos(exp(y)-2*x)/4 +
exp(y)*y*sin(exp(y)-2*x)*cos(x^2/2-y^2/4) -
exp(2*y)*sin(x^2/2-y^2/4)*cos(exp(y)-2*x)
return(list(f, c(f1,f2), matrix(c(f11,f12,f12,f22),2,2)))
}
This is, to say the least, nasty. You may use
deriveR function to compute the gradient and Hessian.
The golden-section method is based on the same “bracketing” idea used in the bisection method for root finding and does not require derivative information. Here, we successively isolates smaller intervals (brackets) that contain a local maximum.
Instead of two in root finding, to isolate a local maximum, we use three points: \(x_l\), \(x_r\), and \(x_m\) such that \(x_m\in(x_l,x_r)\). A key observation is:
\(f(x_l) \leq f(x_m)\) and \(f(x_r) \leq f(x_m)\) imply the existence of a local maximum in the interval \([x_l,x_r]\).
Similar to the bisection method, the golden-section method starts with some \(x_m\in(x_l,x_r)\) such that \(f(x_l) \leq f(x_m)\) and \(f(x_r) \leq f(x_m)\).
Below is the algorithm, where \(f(x_m)\) is an estimate of the local maximum and kept updated until the stopping criterion is met.
The idea at #3 is that we want to shrink the bracket while keeping the estimate of the maximiser in the bracket. If \(f(y)<f(x_m)\), we cannot update the estimate but still want to shrink the bracket.
At Step #2 above, the golden ratio arises when our motivation is to choose \(y\) so that the ratio of the lengths of the larger to the smaller intervals remains the same. As an example, suppose we find ourselves in the following situation.
curve(-x^2, xaxt="n", yaxt="n", ann=FALSE, from=-1, to=1)
letters <- c(expression(x[l]),expression(x[m]),"y",expression(x[r]))
from_x <- c(-.9,-.1,.25,.9)
axis(1, from_x, labels=letters)
from_y <- rep(-1.2, 4)
to_x <- from_x
to_y <- sapply(from_x, function(x) return(-x^2))
segments(from_x, from_y, to_x, to_y, lty=3)
arrows(from_x[1], -.9, to_x[2], -.9, length=.1, code=3)
text((to_x[1]+to_x[2])/2, -.9, "a", pos=3)
arrows(from_x[2], -.9, to_x[4], -.9, length=.1, code=3)
text((to_x[2]+to_x[4])/2, -.9, "b", pos=3)
arrows(from_x[2], -.6, to_x[3], -.6, length=.1, code=3)
text((to_x[2]+to_x[3])/2, -.6, "c", pos=3)
points(to_x[2:3], to_y[2:3])
text(to_x[2], to_y[2]+.01, expression(f(x[m])), pos=2)
text(to_x[3], to_y[3]+.01, expression(f(y)), pos=4)
As you can see, the right subinterval is larger, so \(I = (x_m,x_r)\). Then, we will pick \(y\in I\) (equivalently, pick \(c\)) such that \(\frac{a}{c} = \frac{b}{a}\) as \(f(y)<f(x_m)\).
Instead, if below is the situation, where we get a better estimate of the maximum, then we will pick \(y\) (or \(c\)) such that \(\frac{b-c}{c} = \frac{b}{a}\) as \(f(y)\geq f(x_m)\).
curve(-x^2, xaxt="n", yaxt="n", ann=FALSE, from=-1, to=1)
letters <- c(expression(x[l]),expression(x[m]),"y",expression(x[r]))
from_x <- c(-.9,-.2,.1,.9)
axis(1, from_x, labels=letters)
from_y <- rep(-1.2, 4)
to_x <- from_x
to_y <- sapply(from_x, function(x) return(-x^2))
segments(from_x, from_y, to_x, to_y, lty=3)
arrows(from_x[1], -.9, to_x[2], -.9, length=.1, code=3)
text((to_x[1]+to_x[2])/2, -.9, "a", pos=3)
arrows(from_x[2], -.9, to_x[4], -.9, length=.1, code=3)
text((to_x[2]+to_x[4])/2, -.9, "b", pos=3)
arrows(from_x[2], -.6, to_x[3], -.6, length=.1, code=3)
text((to_x[2]+to_x[3])/2, -.6, "c", pos=3)
points(to_x[2:3], to_y[2:3])
text(to_x[2], to_y[2]+.01, expression(f(x[m])), pos=2)
text(to_x[3], to_y[3]+.01, expression(f(y)), pos=4)
So, even in the same \(I = (x_m,x_r)\) case, two situations can arise from our pick of \(y\) and we do not know which, because \(f(y)\) is unknown before picking \(y\) and evaluating \(f(y)\). So, our motivation is to pick \(y\) (or \(c\)) no matter which scenario happens to arise.
Formally, pick \(c\) such that \[ \frac{a}{c} = \frac{b}{a} \quad \text{and} \quad \frac{b-c}{c} = \frac{b}{a} .\] Or, eliminating \(c\) using both equations, we have \[\begin{align} & \frac{b-a^2/b}{a^2/b} = \frac{b}{a}\\ \Leftrightarrow\quad & \frac{b^2}{a^2} - \frac{b}{a} - 1 = 0\\ \Leftrightarrow\quad & \rho^2 - \rho - 1 = 0\\ \end{align}\] where \(\rho = b/a\). A positive root is the golden-ratio \(\rho = \frac{1+\sqrt{5}}{2}\).
In short, we choose \[ y = x_m + c = x_m + \frac{a^2}{b} = x_m + \frac{a}{\rho} = x_m + \frac{2(x_m-x_l)}{1+\sqrt{5}}.\]
Try to see what will happen if the left subinterval is larger and \(I = (x_l,x_m)\).
Let’s apply it to \(f(x)=\sin(3\pi x)/x+x^3-5x^2+6x\).
gsection <- function(f, x.l, x.r, x.m, tol=1e-8) {
f.l <- f(x.l)[1]
f.r <- f(x.r)[1]
f.m <- f(x.m)[1]
gr <- (1 + sqrt(5))/2
n <- 0
while ((x.r - x.l) > tol) {
n <- n + 1
if ((x.r - x.m) > (x.m - x.l)) {
y <- x.m + (x.m - x.l)/gr
f.y <- f(y)[1]
if (f.y >= f.m) {
x.l <- x.m
f.l <- f.m
x.m <- y
f.m <- f.y
} else {
x.r <- y
f.r <- f.y
}
} else {
y <- x.m - (x.r - x.m)/gr
f.y <- f(y)[1]
if (f.y >= f.m) {
x.r <- x.m
f.r <- f.m
x.m <- y
f.m <- f.y
} else {
x.l <- y
f.l <- f.y
}
}
}
return(list("f"=f(x.m)[1], "x.m"=x.m, "Number of steps"=n))
}
gsection(test1, 1.1, 2.1, 1.7)
## $f
## [1] 1.851551
##
## $x.m
## [1] 1.455602
##
## $`Number of steps`
## [1] 39
Try other triples and see which local maxima you find.
Thanks to the popularity of machine learning, especially neural networks, gradient descent has been studied extensively as a class of optimisation methods.
Since we are doing maximisation, “gradient ascent” may be an appropriate title. But, the term “gradient descent” is so common that we stick to it.
Recall that a gradient \(\nabla f\) is a vector of partial derivatives of \(f\), containing information about the direction of fastest increase in \(f\) (as well as the rate of increase).
When searching for a local maximum, it is almost natural, albeit naïve, to follow the direction of gradient at each iteration. Hence, the algorithm. \[ x_{n+1} = x_n + \alpha \nabla f(x_n).\]
\(\alpha\) is a step size that determines how far we ascend in the direction of \(\nabla f(x_n)\).
Remember that, unless \(f\) is linear, \(\nabla f\) is a function of \(x\) and \(\nabla f(x_n)\) will not tell the steepest direction once we moving out of \(x_n\). So, as a rule of thumb, we should take a small step to the extent of computational capacity in hand.
Let’s first apply it to the 1-D problem.
gd1 <- function(fg, x0, alpha=1e-2, tol=1e-8, n.max=1000) {
x <- x0
fg.x <- fg(x)
n <- 0
while ((abs(alpha*fg.x[2]) > tol) & (n < n.max)) {
x <- x + alpha*fg.x[2]
fg.x <- fg(x)
n <- n + 1
}
return(list("f"=fg(x)[1], "x"=x, "Number of steps"=n))
}
Note the stopping criteria. Besides the maximum number of iterations, it uses \(\alpha f'(x)\leq\epsilon\) because that is an actual step to be taken and \(f'(x^*)=0\) at a local maximum.
Try other stopping criteria such as \(|f(x_{n+1})-f(x_n)|\leq\epsilon\) and \(\|x_{n+1}-x_n\|\leq\epsilon\).
gd1(test1, 1.3)
## $f
## [1] 1.851551
##
## $x
## [1] 1.455602
##
## $`Number of steps`
## [1] 19
Identify all local maxima by changing initial points. You can eyeball the graph and determine points from which to hill-climb. What will happen if you start at \(x\leq0.4\) or \(x\geq3.1\)?
Next, to the 2-D problem.
gd2 <- function(fg, x0, alpha=1e-2, tol=1e-7, n.max=1e4) {
x <- x0
grad <- fg(x)[[2]]
n <- 0
while ((max(abs(alpha*grad)) > tol) & (n < n.max)) {
x <- x + alpha*grad
grad <- fg(x)[[2]]
n <- n + 1
}
if (n < n.max) {
return(list("f"=fg(x)[[1]], "x"=x, "Number of steps"=n))
} else {
return(NULL)
}
}
gd2(test2, c(1.5, .4))
## $f
## [1] 1
##
## $x
## [1] 2.030692 1.401523
##
## $`Number of steps`
## [1] 724
Play with different \(\alpha\) and \(x_0\).
One-dimensional problems are relatively easy. We need not too much worry about a step size. In the end, there is only two directions to move — left or right. So, as long as a step size is not excessively large, we will just march on and get there.
However, “excessively large” may not be so obvious. To illustrate, let’s consider the following situation.
f <- function(x) -x*(x-2)*(x-4)*(x-5)
curve(f, from=-.3, to=5.5, ylab="y")
title(expression(y==-x^4+11*x^3-38*x^2+40*x))
x <- c(0.75, 2.93, 4.57)
points(x, f(x))
points(c(0,3.2), c(0,f(3.2)), pch=4)
arrows(-.15, -2, -.02, 3.5,length=0.1, angle=10)
text(3.2, f(3.2)-1.5, "?", pos=4)
The derivative is \(f'(x) = -4x^3+33x^2-76x+40\), which implies \(f'(x)=0\) at \(x\in\{0.75,2.93,4.57\}\).
Suppose \(x_n=0\), where we have \(f'(0)=40\). Thus, \(x_{n+1} = 40\alpha\). So, if \(\alpha>0.074\), we will have \(x_{n+1}>2.93\) and climb up to the second local maximum \(x^*=4.57\). Missing \(x^*=.75\) could be a serious problem, depending on the phenomenon being modelled.
The issue is much more difficult to handle in higher dimensions, where there are the infinite number of directions to choose from and we could circle around so long before approaching any local maximum. Of course, if we always use a very small \(\alpha\), progress might be too slow to be practical.
Suppose that we are happy to take a step in the direction of gradient. Then, a natural solution to the step size selection is to choose: \[\alpha^* \in \arg\max_{\alpha\geq 0}g(\alpha) = \arg\max_{\alpha\geq 0} f(x_n + \alpha \nabla f(x_n)) .\] This is another maximisation problem, which could itself be difficult to solve exactly. Nonetheless, it is a 1-D problem, and there are ways to obtain good approximated solutions.
Remember that this maximisation problem with respect to \(\alpha\) is solved at each iteration in the origianl maximisation problem.
Here, we try to find an exact local maximiser \(\alpha^*\) using optimize, one of R’s built-in optimisers.
In practice, the vast majority of people use built-in functions, as we need not reinvent the wheel. Nonetheless, for learning, it is absolutely important to understand basic algorithms and implement them by hand. So, try to use other techniques including your own code for the golden-section method!
First, the 1-D problem.
line.search1 <- function(fg, x0, tol=1e-7, n.max=1e3) {
g <- function(alpha, x, grad) fg(x + alpha*grad)[1]
x <- x0
grad <- fg(x)[2]
alpha <- optimize(g, c(0,.1), x=x, grad=grad, maximum=TRUE)$maximum
n <- 0
while ((abs(alpha*grad) > tol) & (n < n.max)) {
x <- x + alpha*grad
grad <- fg(x)[2]
alpha <- optimize(g, c(0,.1), x=x, grad=grad, maximum=TRUE)$maximum
n <- n + 1
}
if (n < n.max) {
return(list("f"=fg(x)[[1]], "x"=x, "Number of steps"=n))
} else {
return(NULL)
}
}
line.search1(test1, 1.3)
## $f
## [1] 1.851551
##
## $x
## [1] 1.455602
##
## $`Number of steps`
## [1] 2
Next, the 2-D problem.
line.search2 <- function(fg, x0, tol=1e-8, n.max=1e4) {
g <- function(alpha, x, grad) fg(x+alpha*grad)[[1]]
x <- x0
grad <- fg(x)[[2]]
alpha <- optimize(g, c(0,.1), x=x, grad=grad, maximum=TRUE)$maximum
n <- 0
while (max((abs(alpha*grad)) > tol) & (n < n.max)) {
x <- x + alpha*grad
grad <- fg(x)[[2]]
alpha <- optimize(g, c(0,.1), x=x, grad=grad, maximum=TRUE)$maximum
n <- n + 1
}
if (n < n.max) {
return(list("f"=fg(x)[[1]], "x"=x, "Number of steps"=n))
} else {
return(NULL)
}
}
line.search2(test2, c(1.5, .6))
## $f
## [1] 1
##
## $x
## [1] 2.030697 1.401526
##
## $`Number of steps`
## [1] 89
While ignoring the cost for inner optimisation, we see significant reduction in the number of steps taken to reach local maxima.
In 1-D problems, Newton’s method is simply to apply the Newton-Raphson method to find a root of \(f'\). This makes sense because \(f'(x^*)=0\) is a necessary condition for a local maximum (with a catch of being merely “necessary”). Therefore, assuming that \(f\) is twice continuously differentiable, we have: \[ x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)} .\]
For some reason, when the Newton-Raphson method is applied for optimisation, it is just called the Newton’s method. Poor Joseph Raphson…
To generalize it to multi-dimensional problems, let’s first recall the key idea used in the Newton-Raphson method. When iteratively searching for a root of \(g\), the method uses linear approximation of \(g\) at the current guess \(x_n\) as it is easy to find a root of linear function. Then, the approximated root becomes our next guess \(x_{n+1}\).
Now, imagine a vector-valued function with \(k\) inputs and \(k\) outputs: \(g:\mathbb{R}^k\to\mathbb{R}^k\). In this case, the equivalents of root and derivative are respectively \(0\) vector and the Jacobian matrix (\(J\)). Doing the same approximation at \(x_n\in\mathbb{R}^k\), we get: \[ g(x) \approx g(x_n) + J(x_n)(x-x_n) .\] Setting the left-hand side equal to \(0\) and rearranging, we derive: \[ x_{n+1} = x_n - J(x_n)^{-1}g(x_n) .\]
We implicitly assume \(f\) is continuously differentiable and has a non-zero determinant at \(x_n\).
Based on this, if we assume the objective function \(f\) is twice continuously differentiable, then we have multi-dimensional Newton’s method: \[ x_{n+1} = x_n - H(x_n)^{-1}\nabla f(x_n) ,\] where \(H\) is the Hessian matrix of \(f\).
To see the correspondence, remember that for \(f:\mathbb{R}^k\to\mathbb{R}\), \(\nabla f\) is a vector-valued function and its Jacobian is the Hessian of \(f\).
Another derivation is:
Let’s apply Newton’s method to the 2-D testbed.
newton <- function(fgh, x0, tol=1e-8, n.max = 100) {
x <- x0
grad <- fgh(x)[[2]]
H <- fgh(x)[[3]]
n <- 0
while ((max(abs(grad))>tol) & (n < n.max)) {
x <- x - solve(H, grad)
grad <- fgh(x)[[2]]
H <- fgh(x)[[3]]
n <- n + 1
}
if (n < n.max) {
return(list("f"=fgh(x)[[1]], "x"=x, "Number of steps"=n))
} else {
return(NULL)
}
}
newton(test2, c(1.5, .6))
## $f
## [1] 1.253638e-20
##
## $x
## [1] 9.899083e-10 1.366392e-09
##
## $`Number of steps`
## [1] 7
When experimenting various \(x_0\) with this test function, Newton’s method is extremely sensitive to the choice of \(x_0\) and not very reliable.
Since the method is based on root finding of \(\nabla f\), it may converge to wherever the gradient vanishes: minima, maxima or saddle points.
When Newton’s method works, its convergence is quick, often using only a fraction of steps taken by line search. For example, it works for \(f(x,y)=x(x-1)(x+1) - y(y-1)(y+1)\) plotted below. The price for speed is the requirement of second derivatives, which will be infeasible in complex problems.
X = seq(-1, 1, length=100)
Y = seq(-1, 1, length=100)
XY = expand.grid(X=X, Y=Y) # create a grid
f2 <- function(x, y){
x*(x-1)*(x+1) - y*(y-1)*(y+1)
}
Z <- f2(XY$X, XY$Y)
s = interp(x=XY$X, y=XY$Y, z=Z)
plot_ly(x=s$x, y=s$y, z=t(s$z)) %>% add_surface()